Goto

Collaborating Authors

 face frame


Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction

Zhao, Yuan, Liu, Rui, Cong, Gaoxiang

arXiv.org Artificial Intelligence

Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.


Continuous Speech Recognition using EEG and Video

Krishna, Gautam, Carnahan, Mason, Tran, Co, Tewfik, Ahmed H

arXiv.org Machine Learning

--In this paper we investigate whether electroen-cephalography (EEG) features can be used to improve the performance of continuous visual speech recognition systems. We implemented a connectionist temporal classification (CTC) based end-to-end automatic speech recognition (ASR) model for performing recognition. Our results demonstrate that EEG features are helpful in enhancing the performance of continuous visual speech recognition systems. In recent years there has been lot of interesting work done in the fields of lip reading and audio visual speech recognition. In [1] authors demonstrated end-to-end sentence level lip reading and in [2] authors demonstrated deep learning based end-to- end audio visual speech recognition.